Boosting the Efficiency in Similarity Search on Signature Collections

نویسنده

  • Jong Wook Kim
چکیده

Computing all signature pairs whose bit differences are less than or equal to a given threshold in large signature collections is an important problem in many applications. In this paper, we leverage MapReduce-based parallelization in order to enable scalable similarity search on the signatures. A road-block in using MapReduce framework in this problem, however, is that the cost of merging and sorting intermediate key-value pairs produced by multiple mappers can be prohibitively expensive when they do not fit into the main memory. Thus, in this paper, we propose S4igpart (Scalable Similarity Search on the Signatures), a novel MapReduce-based technique for computing similarity search over large signature collections. In particular, the approach presented in this paper relies on a data partitioning scheme which enables to avoid costly disk-based merge and sort operations. The experiment results show that the proposed technique, S4igpart, significantly improves the efficiency. Keywords— MapReduce, Performance, Scalability, Signature, Similarity Search

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

PP-Index: Using Permutation Prefixes for Efficient and Scalable Approximate Similarity Search

We present the Permutation Prefix Index (PP-Index), an index data structure that allows to perform efficient approximate similarity search. The PP-Index belongs to the family of the permutationbased indexes, which are based on representing any indexed object with “its view of the surrounding world”, i.e., a list of the elements of a set of reference objects sorted by their distance order with r...

متن کامل

Similarity Search in Sets and Categorical Data Using the Signature Tree

Data mining applications analyze large collections of set data and high dimensional categorical data. Search on these data types is not restricted to the classic problems of mining association rules and classification, but similarity search is also a frequently applied operation. Access methods for multidimensional numerical data are inappropriate for this problem and specialized indexes are ne...

متن کامل

Data as Ensembles of Records: Representation and Comparison

Many collections of data do not come packaged in a form amenable to the ready application of machine learning techniques. Nevertheless, there has been only limited research on the problem of preparing raw data for learning, perhaps because widespread differences between domains make generalization difficult. This paper focuses on one common class of raw data, in which the entities of interest a...

متن کامل

Fast indexing: a comparative evaluation

In this evaluation the efficiency of three image signature called angular spectrum, Hough based signature and color histogram are tested The first signature is intrinsic hierarchical (deriving from image frequency spectrum) and than non signature space reduction technique is used. The second is a short signature directly indexed and the last (color histogram) need to be reduced for fast indexin...

متن کامل

The new protocol blind digital signature based on the discrete logarithm problem on elliptic curve

In recent years it has been trying that with regard to the question of computational complexity of discrete logarithm more strength and less in the elliptic curve than other hard issues, applications such as elliptic curve cryptography, a blind  digital signature method, other methods such as encryption replacement DLP. In this paper, a new blind digital signature scheme based on elliptic curve...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013